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Abstract — We consider the problem of optimizing time av- 
erages in systems with independent and identically distributed 
behavior over renewal frames. This includes scheduling and 
task processing to maximize utility in stochastic networks with 
variable length scheduling modes. Every frame, a new policy is 
implemented that affects the frame size and that creates a vector 
of attributes. An algorithm is developed for choosing policies 
on each frame in order to maximize a concave function of the 
time average attribute vector, subject to additional time average 
constraints. The algorithm is based on Lyapunov optimization 
concepts and involves minimizing a "drift-plus-penalty" ratio 
over each frame. The algorithm can learn efficient behavior 
without a-priori statistical knowledge by sampling from the 
past. Our framework is applicable to a large class of problems, 
including Markov decision problems. 



I. Introduction 

Consider a stochastic system that regularly experiences 
times when the system state is refreshed, called renewal times. 
The goal is to develop a control algorithm that maximizes the 
time average of a reward process associated with the system, 
subject to time average constraints on a collection of penalty 
processes. The renewal-reward theorem is a simple and elegant 
technique for computing time averages in such systems (see, 
for example, [2] [3]). However, the renewal-reward theorem 
requires random events to be independent and identically 
distributed (i.i.d.) over each renewal frame. While this i.i.d. 
assumption may hold if a single control law is implemented 
repeatedly, it is often difficult to choose in advance a single 
control law that optimizes the system subject to the desired 
constraints. This paper investigates the situation where the 
control policies used may differ from frame to frame, and 
are designed to dynamically solve the problem of interest. 

This renewal problem arises in many different applications. 
One application of interest is a task processing network. For 
example, consider a network of wireless devices that repeat- 
edly collaborate to accomplish tasks (such as reporting sensor 
data to a destination, or performing distributed computation on 
data). Tasks are performed one after the other, and for each 
task we must decide what modes of operation and communi- 
cation to use, possibly allowing some nodes of the network 
to remain idle to save power. It is then important to make 
decisions that maximize the time average utility associated 
with task processing, subject to time average power constraints 
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at each node. Alternatively, one may want to minimize time 
average power, subject to constraints on utility and on the 
"left-over" communication rates available for data that is not 
associated with the task processing. 

This paper develops a general framework for solving such 
problems. To do so, we extend the theory of Lyapunov opti- 
mization from [4]. Specifically, work in [4] considers discrete 
time queueing networks and develops a simple drift-plus- 
penalty rule for making optimal decisions. These decisions are 
made in a greedy manner every slot based only on the observed 
traffic and channel conditions for that slot, without requiring 
a-priori knowledge of the underlying probability distribution. 
However, the work in [4] assumes all slots have fixed length, 
the random network condition is observed at the beginning of 
each slot and does not change over the slot, and this condition 
is not influenced by control actions. The general renewal 
problem treated in the current paper is more complex because 
each frame may have a different length and may contain a 
sequence of random events. The frame length and the random 
event sequence may depend on the control decisions made 
over the course of the frame. Rather than making a single 
decision every slot, every frame we must specify a policy, 
being a contingency plan for making decisions over the course 
of the frame in reaction to the resulting system events. 

This paper solves the general problem with a conceptually 
simple technique that chooses a policy to minimize a drift- 
plus-penalty ratio every frame. We first develop algorithms 
for minimizing the time average of a penalty process subject 
to a collection of time average constraints. We then consider 
maximization of a concave function of a vector of time 
average attributes subject to similar constraints. This utility 
maximization problem is challenging because of the variable 
frame length. We overcome this challenge with a novel trans- 
formation together with a variation of Jensen's inequality. 

While this paper focuses on task processing applications, 
we note that our renewal framework can also handle Markov 
decision problems. Specifically, suppose the system operates 
according to either a continuous or discrete time Markov chain 
with control-dependent transition probabilities. If the chain has 
a recurrent state, then renewals can be defined as re-visitations 
to this state, and the same drift-plus-penalty ratio technique 
can be applied. However, the drift-plus-penalty ratio may be 
difficult to optimize for Markov decision problems with high 
dimension (see also [5]). 

Prior work on learning algorithms for Markov decision 
problems is in [6], and related work in [7] [8] [9] [10] considers 
learning for optimization of energy and delay in queueing 
systems. The works [6]-[10] use stochastic approximation 
theory and two-timescale convergence analysis. The Lagrange 
multiplier updates in [6]-[10] are analogous to the virtual 
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Fig. 1. A timeline illustrating renewal frames for the system. 

queue updates we use in this paper. However, the Lyapunov 
optimization framework we use is different and does not re- 
quire a two-timescale approach. It also provides more explicit 
bounds on convergence times and deviations from optimality, 
and allows a broader class of problems such as task processing 
problems. 

The Lyapunov optimization technique that we use in this 
paper is based on our previous work in [4][1 1] [12] [13] that de- 
velops the drift-plus-penalty method for stochastic network op- 
timization, including opportunistic scheduling for throughput- 
utility maximization [4] [11] [13] and average power minimiza- 
tion [12] (see also [14]). Alternative "fluid-based" stochastic 
optimization techniques for queueing networks are developed 
in [15] [16] [17] [18], and dual and primal-dual algorithms for 
systems without queues, based on tracking a corresponding 
static optimization problem, are in [19] [20] [21]. Our current 
paper considers the more complex renewal problem, and 
leverages ideas in [5] [22], where [5] considers a frame-based 
Lyapunov framework for Markov decision problems involving 
network delay, and [22] develops a ratio rule for utility 
optimization in wireless systems with variable length frames 
and time-correlated channels. 

Recent work in [23] considers a task processing system 
where multiple wireless "reporting nodes" select data formats 
(e.g., "voice" or "video") in which to deliver sensed informa- 
tion. The work [23] also uses a renewal structure. However, it 
assumes a single random event occurs at the beginning of each 
renewal frame, and the event and frame size are not influenced 
by control actions. More general problems can be treated using 
the theory developed in the current paper. 

II. Renewal System Model 

Consider a system that operates over renewal frames. 
Specifically, consider the timeline of non-negative real times 
t > 0, and suppose this timeline is segmented into successive 
frames of duration {T[0], T[l], T[2], . . .}, as shown in Fig. Q] 
Define t[0] — 0, and for each positive integer r define t[r] as 
the rth renewal time: 

t[r)^EZoT\i] 

The interval of all times t such that t[r] < t < t[r + 1] 
is defined as the rth renewal frame, defined for each r G 
{0,1,2,...}. 

At the beginning of each renewal frame r, the controller 
selects a policy Tr[r] from an abstract policy space P, and 
implements the policy over the duration of the frame. There 
may be random events that arise over the renewal frame 
(with distributions that are possibly dependent on the pol- 
icy), and the policy specifies a contingency plan for react- 



ing to these events. The policy incurs a vector of penal- 
ties y[r] = (yo[r],yi[r],...,y L [r]) and attributes x[r] = 
(xi[r), . . . ,ij./[r]) for some integers L > 0, M > (where 
L = corresponds to problems without y[r] penalties, and 
M — corresponds to problems without x[r) attributes). 
The policy may also affect the renewal frame duration T[r]. 
Formally, the values T[r], yi[r], x m [r] are determined by 
random functions T(-), yi(-), x m (-) of the policy 7r[r]: 

T[r] A f(7r[r]) (1) 
Vi[r] = foHr]) VI 6 {0,1,..., L} (2) 
x m [r] A x m (ir[r]) Vme{l,...,M} (3) 

We assume the values of [r(7r[r]), (y;(7r[r])), (i m (7r[r]))] 
for frame r are conditionally independent of events in previous 
frames given the particular policy it = ir[r], and are identically 
distributed over all frames that use the same policy ir. 

Consider now a particular control algorithm that chooses 
policies ir[r] £ V every frame r according to some well 
defined (possibly probabilistic) rule, and define the following 
frame-average expectations, defined for integers R > 0: 

T[R]±±J2E{T[r}} , y l [R]±j i Y l Mvi[r]} (4) 

r=0 ' r=0 

where we recall that T[r], yi[r], x m [r) depend on the policy 
7r[r] by (HJ-©. Define x m [i?] similarly, and define the infinite 
horizon frame-average expectations T, y t , x m by: 

(T,y h x m ) = Km (T[i?],yji?],x m [i?]) 

R— >oo 

where we temporarily assume the limits are well defined. 

A. Optimization Objective 

The first type of problem we consider uses only penalties 
y[r]; We must choose a policy n[r] E V every frame r to 
minimize the ratio y /T subject to constraints on yjT: 

Minimize: Vq/T (5) 

Subject to: yjT < q VI G {1, . . . , L} (6) 
ir[r] G V Vr G {0, 1, 2, . . .} (7) 

where ci for I e {1, ...,L} are a given collection of real- 
valued (possibly negative) constants. 

The motivation for looking at the ratio yjT is that it defines 
the time average penalty associated with the yi [r] process. To 
see this, suppose the following limits converge to constants 
yf v and T av with probability 1: 

R-l R-l 

lim - J2 ViM = vr , J™ „ £ T[r] = T av (w.p.l) 

r=0 r=0 

Under very mild conditions, the existence of the limits y^ v 
and T av implies the frame-average expectations also have well 
defined limits, with y l = yf v and T = T av . This holds, 
for example, whenever yi[r\ and T[r] are deterministically 
bounded by finite constants, or when more general conditions 
hold that allow the Lebesgue dominated convergence theorem 
to be applied [24]. Then the time average penalty per unit 
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time associated with yi [r] (sampled only at renewal times for 
simplicity) satisfies with probability 1: 



lim 



lim 



s2 



R-l r I 
r=0 VlM 



R ^iJ2r=o T[r] 



y± 

T 



Therefore, the value yjT indeed represents the limiting 
penalty per unit time associated with the process yi[r]. 

The problem (|5]l-(|7]i seeks only to minimize a time average 
subject to time average constraints. The second problem we 
consider, more general than the first, seeks to maximize a con- 
cave and entrywise non-decreasing function 4>(~f) of the time 
average attribute vector ratio x/T, where x = (xi, . . . ,%): 

Maximize: 4>(x/T) (8) 

Subject to: yjT < q Vi € {1, . . . , L} (9) 
?r[r] EV Vr e {0,1,2,...} (10) 

where ^(7) is a given concave and entrywise non-decreasing 
utility function defined over 7 = (71, ... , 7m) S K M - 

We note that for some problems, one may be more interested 
in optimizing the per- frame average y , rather than the time 
average y Q /T, and such problems are treated in Section [VI- A| 



B. Boundedness Assumptions 

We assume a; m [r], T[r], and yo[r] have bounded conditional 
expectations, regardless of the policy. That is, there are finite 
constants x™ in , x™ ax , T mm , T max , y^ m , yff ax such that for 
all 7r[r] £ V and all m € {1, ... , M} we have: 



yT n <E{y (7r[r])|7r[r]} <y?°* 
< T mm < E{f (7r[r])|7r[r]} < T max 
x™ in <E{x m (Tr[r))\ir[r]}<x™* 



Define 7 ™ n and ^ ax by: 



/ n 



max^/T" 



7T 



mail 



Define the hyper-rectangle 7^. by: 

^A{ 7 g R M | 7 r" < 7™ < 7™°* Vm £ {1, . . .,M}} (11) 

Then for any algorithm that chooses policies ir[r] €V for all 
frames r, it is not difficult to show that x m [R]/T[R] € 1Z 
for all R £ {1,2, 3, . . .}, where T[R], x m [R], T[R] are frame 
average expectations over the first R frames, as defined by (0|. 

Finally, we assume the conditional second moments of T[r], 
x m [r], and yi[r] (for I ^ 0) are finite, regardless of the policy. 
That is, there is a finite constant o\ such that for all ir[r] £ V: 

E|T (7r[r]) 2 |7r[r]| < a x 

E {y,(7T[r}) 2 \7r[r}} < ax Vi 6 {1, . . . , L} 
E{x m (ir[r}) 2 \ir[r} } < a\ Vm G {1, . . . , M } 



C. Optimality of i.i.d. Algorithms 

We now state the problem ©-(171 more precisely, using 
lim sups which do not require existence of a well defined limit: 



Minimize: 



lim sup H _ 



mm 

T\R] 



(12) 



?r[r] e V Vr e {0,1,2,...} 



Subject to: limsup^^ <c, VIe{l,..,L} (13) 

(14) 



Assume that the constraints (fT3Ti-(fT4l are feasible, and define 
ratio opt as the infimum ratio in (fT2] i over all algorithms that 
satisfy these constraints. 

Define an i.i.d. algorithm as one that, at the beginning of 
each new frame r e {0,1,2,...}, chooses a policy n[r] by 
independently and probabilistically selecting it G V according 
to some distribution that is the same for all frames r. Let it* [r] 
represent such an i.i.d. algorithm. Then the random variables 
{T(tt* [r})}^ are independent and identically distributed 
(i.i.d.) over frames, as are {yi(7r*[r])}^_ . Thus, by the law 
of large numbers, these have well defined time averages T 
and y* with probability 1 , where the averages are equal to the 
expectations over one frame. 

Lemma 1: (Optimality over i.i.d. algorithms) If the con- 
straints (fT3l-(fl4]i are feasible, then for any 8 > 0, there exists 
an i.i.d. algorithm ir* [r] that satisfies: 

E{y (7T*W)} <E{f{ir*[r})}{ratio opt +5) (15) 

E{ yi (TT*[r})} <E{f(n*[r])}(ci+S) VI £ {1, . . . , L} (16) 
Proof: See Appendix A. □ 



III. Optimizing Time Averages 

Here we develop an algorithm to treat the problem 0-(|7]). 
To treat the constraints yjT < cu which are equivalent to 
the constraints y l < c{T, we define virtual queues Zi[r] for 
/ € {1, . . . ,L}, with finite initial condition and with update 
equation: 

Z l [r+l}=max[Z l [r]+y l [r}-ciT[r],0]Vle{l,...,L} (17) 

The intuition is that if we can stabilize the queue Zi[r], then 
the time average of the "service process" c/T[r] is greater than 
or equal to the time average of the "arrival process" yi [r] (see 
also [12] for application to virtual power queues for meeting 
time average power constraints). 

Let Z[r] = (Zi[r], . . . , Zi\r\) be the vector of virtual 
queues, and define the following quadratic Lyapunov function 
L(Z[r}): 

The value L(Z[r}) is a scalar measure of the size of the 
queue backlogs. The intuition is that if we can take actions 
that consistently push this value down, then queues can be 
stabilized. Define the frame-based conditional Lyapunov drift 
A(Z[r])by: 

A(Z[r])AE{L(Z[r + l])-L(Z[r])|Z[r]} 



EXTENDED VERSION 



4 



Lemma 2: Under any control decision for choosing ir[r] £ 
V, we have for all r and all possible Z[r]: 

A(Z[r]) < B + E {Ef =1 Z l [r)[y l [r] - Q T[r]]|Z[r]}(18) 

where B is a constant that satisfies for all r and all possible 
Z[r\. 

1 L 

B>-J2^{(yiM-ciT[r]) 2 \Z[r}} (19) 
" 2=1 

Such a constant £> exists by the boundedness assumptions in 
Section HI-Bl 

Proof: Squaring ( fTTI i yields: 

Zi[r + l] 2 < (Zi[r] +yi[r] - aT[r}) 2 

= Zfr] 2 + {y l [r]-c l T[r]) 2 
+2Z l [r](y l [r]-c l T[r]) 

Taking conditional expectations, dividing by 2, and summing 
over I £ {1, . . . , L} yields the result. □ 

A. The Drift-Plus-Penalty Ratio Algorithm 

Our Drift-Plus-Penalty Ratio Algorithm is designed to min- 
imize a sum of the variables on the right-hand-side of the drift 
bound ( fT~8b and a penalty term, divided by an expected frame 
size, as in [22]. The penalty term uses a non-negative constant 
V that will be shown to affect a performance tradeoff: 

• (Policy Selection) Every frame r £ {0, 1,2,.. .}, observe 
the virtual queues Z[r] and choose a policy n[r] £ V to 
minimize the following expression: 



E{f(w[r])\Z[r]} 



(20) 



• (Queue Update) Observe the resulting y[r] and T[r] 
values, and update virtual queues Z\ [r] by ( TPTl i. 

Details on minimizing ( f20b are given in Section [V] Rather 
than assuming we achieve the exact infimum of (f20b over all 
policies 7r[r] £ V, it is useful to allow our decisions to come 
within an additive constant C of the infimum. 

Definition 1: A policy 7r[r] is a C -additive approximation 
for the problem ( f20b if for a given constant C > we have: 



E 



{Vy (n[r}) + Ef =1 Z, 



i[r\yi[n r \ 



E 



c 



inf 



|r(7r[r 
E{Vy (n) 



\Z[r}} 

^Eti Zi[r]yiM\Z[r}} 



v{f(n)\Z[r]} 
In Section IV-BI it is shown that the infimum of d20l i over 
ir £ V is the same as the infimum over the extended class 
of probabilistically mixed strategies that choose a random 
ix £ V according to some distribution (exactly what i.i.d. 
policies do every frame). Thus, if policy 7r[r] is a C-additive 
approximation, then: 



E 



E{vih(%\ 
{f(n[r])\Z[r]} 



< 



C 



Eti Zi[r]m(n[r})\Z[r}} 

E{Vy (K*lr}) + i:tiZiMM**lr])} 
E{f(x*M)} 



(21) 



where ir*[r] is any i.i.d. algorithm. Note that conditional 
expectations given Z[r] are the same as unconditional ex- 
pectations under i.i.d. algorithms, because their decisions are 
independent of system history. 

Theorem 1: (Algorithm Performance) Assume the con- 
straints of problem (fT2l)-(fT4l) are feasible. Fix constants C > 0, 
V > 0, and assume the above algorithm is implemented using 
any C-additive approximation every frame r for the minimiza- 
tion in ( f20b . Assume initial conditions satisfy E{L(Z[0])} < 
oo. Then: 

a) For all Z 6 {1, ... , L} we have: 

lim sup y t [R] /T[R] < q V/ 6 {1, ... , L} (22) 

lim SUP S ^ < °i 

where "w.p.l" stands for "with probability 1." 

b) For all integers R > we have: 

VM < raUo o P t + {_B/T[R} + C) , E {L(Z[0])} 



(23) 



T[R] 
and hence: 



V 



VRT[R] 



lim sup y [R]/T[R] < ratio opt 

R— >oc 



(R jT r ' 



C)/V (24) 



where B is defined in ( fl9l l. and ratio opt is the optimal solution 
to (O-d. 

Thus, the algorithm satisfies all constraints, and the value 
of V can be chosen appropriately large to make (B/T mm + 
C)/V arbitrarily small, ensuring that the time average penalty 
is arbitrarily close to its optimal value ratio opt . The tradeoff 
in choosing a large value of V comes in the size of the Zi [r] 
queues and the number of frames required for E{Z;[i?]} /R 
to approach zero (which affects convergence time of the 
algorithm, see (f34t in the proof). In particular, in Appendix 
B it is shown that there are constants F\ , F2 such that for all 
I £ {1, ...,£} we have: 



T[R] 



<ci + 



1 jFi+VFi , Ef=i E {^[0] 2 } 



R 



R 2 



(25) 



It is clear that the second term on the right-hand-side above 
vanishes as R — > 00, but the number of frames required for 
it to be small depends on the V parameter. This bound holds 
for general problems. A tighter bound can be obtained for 
problems with special structure (see (ITBT l in Appendix G). 

Proof: (Theorem [TJ Consider any frame r £ {0,1,2,...}. 
From ( TT8l > we have: 



A(Z[r}) + VE{y (ir[r])\Z[r}} < B 
+E{v r » (7r[r]) + E^i^[r][wWr])-cif(7r[r])]|Zir 
Substituting (f2Tb into the above yields: 

A(Z[r}) + VE{y (ir{r])\Z{r}} < B + 



]} 



E 



{T(7r[r])|Z[r]} 



C 



E{ Vyo(rr' M)+Ef =1 zMm (V» [r])} 



Ef=i^MQE{f(7r[r])|Z[r]} (26) 
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In the above inequality, n[r] represents the C-additive approx- 
imate decision actually made, and it* [r] is from any alternative 
i.i.d. algorithm. Fixing any S > 0, plugging the i.i.d. algorithm 
7r*[r] from (fT3Tl-([T6jl into the right-hand-side of d26l i. and 
letting <5 — > yields: 

A(Z{r}) + VE{y (n[r})\Z[r}} < B 

+l|f(7r[r])|Z[r]} [C + Vratio°P t ] (27) 

Taking expectations of the above yields: 

E{L(Z[r + l])}-E{L(Z[r})} + VE{y a (ir[r})} < 

B + E tf(w[r})\ [C + Vratio opt ] (28) 

Summing the above over r G {0, . . . , R — 1} for some integer 
R > and dividing by R yields: 

E{L{Z[R})} -E{L(Z[0])} 



R 



VVolR] < 
B + T[R][C + Vratio opt } 



(29) 



Rearranging terms in the above and using the fact that 
E{L(Z[R})} > yields the result of part (b). 

To prove part (a), from (l27T i there is a constant F such that: 



A(Z[r]) < F 



(30) 



Thus, the drift of a quadratic Lyapunov function is bounded 
by a constant. Further, the second moments of per-frame 
changes in Zi[r] are bounded because of the second moment 
assumptions on yi[r] and T[r\. It follows that (see [25]): 



lim E{Zi[R]}/R = 

R— ¥00 



(31) 



lim Zi[R]/R = (to.p.l) (32) 

R— yoo 

Now from the queue update dTTb we have for any frame r: 

Z^r + 1} > Z^r] + y t [r} ~ Cl T[r] 

Summing the above over r 6 {0, . . . , R — 1} for some integer 
R > yields: 



Zi[R)~ m>Er=o{yiM-ciT[r]} 



(33) 



Taking expectations, dividing by R, and using E{Z;[0]} > 
yields for all integers R > 0: 

E{Z t [R}} 

> Vi[R\ ~ ciT[R\ 



R 



Thus: 



m< ci 

T[R] ~ 



RT[R] 



EjZijR}} 



(34) 



Taking limits of the above and using ( TJTT l proves (12211 . A 
similar argument uses ( f32t to prove ( [23) . □ 
Under a mild "Slater-type" assumption that ensures the 
constraints ( TT3l > are achievable with "e-slackness," the queues 
Zi[R] can be shown to be strongly stable, in the sense that 
the time average expectation is bounded by 0(F). If further 
mild fourth moment boundedness assumptions hold for y\ [r] 
and T[r] then the same bound (l24l can be shown to hold for 
pure time averages with probability 1 (see [25] and Appendix 
H). 



IV. Utility Optimization 

Consider now the problem (IBl-dTOli. which seeks to max- 
imize 4>(x/T) subject to yjT < ci for all I £ {1,...,L}. 
We transform this problem of maximizing a function of a 
time average ratio into a problem of the type The 
following variation on Jensen's inequality is crucial in this 
transformation: 

Lemma 3: (Variation on Jensen's Inequality) Let ^(7) be 
any continuous and concave function defined over 7 e 7Z for 
some closed and bounded hyper-rectangle 1Z. 

a) Let (T, 7) be any random vector (with an arbitrary joint 
distribution) that satisfies T > and 7 6 1Z with probability 
1. Assume that < E{T} < 00. Then: 



E{T} 



< 



E]Tt}\ 
E{T} ) 



b) Let (T [r] , 7 [?-] ) be a sequence of arbitrarily correlated 
random vectors for r e {0, 1,2,.. .}. Assume that T[r] > 0, 
7[r] G 1Z for all r (with probability 1), and: 

< T mm < E {T[r}} < T max < 00 Vr G {0, 1,2,.. .} 

Then for any R > 0: 



££^E{T[r]} 



Furthermore, assuming that the limits T0(7) and T7 defined 
below exist, we have: 



T0( 7 )/T < 0(T 7 /T) 



(35) 



where: 



R-l 



T0( 7 ) 4 Um _L ^ E {T[r]0( 7 [r])} 



r=0 

1 «"! 

^7 4 Km _^E{T[r] 7 [r]} 

Proof: Part (b) follows immediately from part (a) by 
defining the random vector (T, 7) to be (T[J],7[J]), where 
J is a uniformly distributed integer in {0, 1, . . . , R — 1} that 
is independent of the (T [r] , 7 [?-] ) process. Part (a) is proven 
in Appendix E. □ 

Now define an auxiliary vector 7[r] = (71 [r], . . . , 7mM)> 
to be chosen in the set 7Z defined in (fTTb on every frame r. 

Lemma 4: (Equivalent Transformation) The problem ([8]l- 
dTOb is equivalent to the following transformed problem: 



Maximize: 



T0( 7 )/T 



(36) 



Subject to: x m > T 7m Vm G {1, . . . , M} (37) 
mfTKci V/G{1,...,L} (38) 
7[r] €11 We {0,1,2,...} (39) 
Tr[r] G V Vr G {0,1,2,...} (40) 
Proof: We briefly sketch the proof: Let n*[r], 7*[r] be 
a policy that optimally solves the above transformed prob- 
lem, and assume for simplicity it yields well defined time 
averages T , x* m , T*<f>{^*), T*j*, and optimal utility 



EXTENDED VERSION 



6 



util* = T*(j){"f*)/T . Then the policy 7r*[r] also satisfies all 
constraints of problem dSt-dlOl). and yields: 



<t>{x*/T ) > 0(T*7*/T*) > T*4>{~f*)/T Autil* 

where the first inequality above holds by d37b and the entry- 
wise non-decreasing property of ^(7), and the second holds 
by d35l >. Thus, the optimal utility of problem (TSb-dTOb is greater 
than or equal to that of the transformed problem. A similar 
argument shows it is also less than or equal to the optimal 
utility of the transformed problem. □ 
The transformed problem (l36ll-(l40li has the structure of the 
problem (|5]l-(|7]i if we define ya[r]= — T[r]<f)('y[r]), write the 
constraints d37b as T^ m — x m < 0, and define policy decision 
7r'[r] = (7r[r], -y[r]) £ V x 1Z. The resulting algorithm is thus 
the same as that given in Section IIII-A1 and for this context 
it is given as follows: For the constraints d38l l, use the same 
virtual queues Zi[r] defined in dTTb . For the constraints d37| ). 
define virtual queues G m [r] for m £ {1, . . . , M} by: 

G m [r + 1] = max[G m [r] + T[r] lm [r] - x m [r},0] (41) 

Define G[r) = (Gi[r), . . . , G M [r}). The drift-plus-penalty ratio 
to minimize every frame r is then: 

E{-Vf (7r[r])0( 7 M) + Ef=t Z,[r]w(7r[r])|Z[r]} 



E 



E{f(7r[r])|Z[r]} 
{l£=i G m [r][f (7r[r]) 7 mM - x m (n[r})]\Z[r}} 



E 



{f^[r])\Z[r}) 



It is easy to see that the above can be minimized by separately 
choosing -f[r] £lZ and ir[r] £ V to minimize their respective 
terms, and that T{n[r]) cancels out of the auxiliary variable 
decisions. The resulting algorithm is thus to observe Z[r] and 
G[r] every frame r £ {0,1,2,...} and perform the following: 

« (Auxiliary Variables) Choose j[r] £ 1Z to maximize: 

^(7M)-E^=iG m [r]7 m [r] 
• (Policy Selection) Choose ir[r] £ V to minimize: 

E {Ef=i ZAAyMr]) ~ Ei=i G m [r]x m (n[r])\Z[r}} 
E|f (7r[r])|Z[r]| 

. (Virtual Queue Update) Update Z[r] by ([T7]i and G[r] 
by (SB. 

The auxiliary variable update is a simple deterministic maxi- 
mization of a concave function over a hyper-rectangle, and can 
be separated into M optimizations of single-variable concave 
functions over an interval if the utility function has the form 
0(7) = E m =i^m(7m). The policy selection step is again 
an optimization of a ratio of expectations and can be done as 
described in Section [V] 

We define a C-additive approximation of the above algo- 
rithm as one that, every frame r, chooses w[r] £ V to yield an 
expectation of ratios in the policy selection step that is within 
C of the infimum. To explicitly describe the performance 



of the above algorithm, we write the problem (|8l)-(fT0t more 
precisely using limsups: 

Minimize: ]hnsuj3 R ^(p(x[R]/T[R]) (42) 

Subject to: Umsup^^ |g < q VZ £ {1, . . . , L} (43) 
ir[r] e V Vr £ {0,1,2,...} (44) 

Assuming the constraints of the above problem are feasible, 
define util opt as the supremum value of (I421 over all algo- 
rithms that satisfy (I43l-d44"li. 

Theorem 2: Suppose the constraints of problem (l42l - (l44l) 
are feasible, and a C-additive approximation is used every 
frame (for C > 0). Then: 

a) For alW £ {1, ...,£} we have: 

lim sup yJi?]/T[i?] < q V/ £ {1, ... , L] 

i?,^oo 

hmsup < ci {w.p.l) 

R ^°° Er=0 T M 

b) The achieved utility satisfies: 



D 



C 



Hminf^f^^ >uUl opt 



where D is a constant that satisfies for all r and all possible 
Z[r\. 

D > i^E{[^[r]- Q TH] 2 |Z[r],G[r]} 



M 

+ - E{[T[r] 7m [r] - x m [r]] 2 \Z[r], G[r]}45) 

m—l 

Such a constant D exists by the boundedness assumptions in 
Section ITTBl 

Proof: See Appendix F. □ 

V. Optimizing the Ratio of Expectations 

Here we show how to minimize the ratio of expectations 
given in d20b (and also in the policy selection stage of 
the previous section). These problems can be written more 
generally as choosing a policy 7r[r] £ V to minimize the ratio: 

E{a(7r)} 
EWtt)} 

where a(7r), b(iv) are random functions of it £ V. The function 
b(n) is equal to T(tt), and is strictly positive and satisfies the 
following for all it £ V: 

< T mm < E{6(7r)|7r} < T max < 00 

The function a(ir) depends on Z[r], and the above expecta- 
tions are implicitly conditioned on Z[r], although we suppress 
this notation for simplicity. Define 8* as the optimal ratio: 

0*— inf 

k£V LE{6(tt)}_ 

If the expectation E{6(7r)} is the same for all tt £ V (such 
as when the frame size is independent of the policy), then 
9* is obtained by infimizing the numerator E{a(7r)}. This is 
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typically easier (often involving learning for stochastic shortest 
path computations [26] [5]). Otherwise, the following simple 
lemma is useful. 
Lemma 5: We have: 

inf E {a(7r) - 0*6(tt)} = (46) 

Further, for any real number 8, we have: 

inf E {a(7r) - 9b(n)} < if 6 > 0* (47) 
inf E {a(ir) - 0&(tt)} > if 8 < 8* (48) 

ttEV 

Proof: We first assume the result of d46l ) and use it to prove 
(|47]!-(|4H]i. Suppose that > 0*. We then have for any tt € V: 



E{a(7r) - 6b(n)} = E {a(n) — 8*b(n) - (8 
< E{a(7r) -8*b(ir)} - (6 

Thus: 



>(*)} 
^* \rpmin 



inf E{a(7r) -6»6(tt)} < infE{a(7r) 



- 0*&(tt)} 

-(6»-6»*)T mm 
= O-(0-0*)T mi " <0 

where the equality holds by (1461 1. This proves d47t . 
Now suppose that 8 < 8*. Then for any tt G T 5 : 

E {a(7r) - 0&(tt)} = E {a(ir) - 8*b{ir) + {8* - 0)&(tt)} 
> E {a(vr) - 8*b(n)} + {8* - 8)T mm 

Taking infimums of both sides, again using (|46]l, proves: 

inf E{a(vr) - 8b(ir)} > + (8* - 8)T mm > 

This proves d48l . 

It remains only to prove d46l l. We have for any policy tt G V: 

E{q(7r)} [eWtt)}" 

E{fe(7r)} " nev |_E{6(tt)}_ 
Therefore, because E{fo(7r)} > 0, we have for all tt G V: 
E{a(7r)} -0*E{&(tt)} > 

and hence: 

inf E{a(7r) -8*b(ir)} > 

It remains only to prove the reverse inequality. Fix S > 0. By 
definition of 8* as the infimum ratio, there is a policy tt* G V 
that satisfies: 

E{a(TT*)} 



< 



Thus: 



and so: 



E{&(tt*)} 



E{a(7r*)} < 6»*E{&( 7 r*)} + ( 5E{6(7r*)} 



E {a(7T*)} - 8*E {b{TT*)} < ST max 
Because tt* is just a particular algorithm in V, it follows that: 

inf E{a(7r) - 8*b(n)} < ST max 

This holds for all S > 0. Taking a limit as 5 — > completes 
the proof. □ 



A. The Bisection Algorithm 

Lemma [5] immediately leads to the following simple bi- 
section algorithm: Suppose we have upper and lower bounds 



and 



so that we know 8 r , 



< 



< 



Then 



we can define 8 — (8 m i n + 8 max )/2, and compute the value 
of inf 7re - P E{a(7r) - 6>6(7r)}. If the result is 0, then 8 = 8*. 
If positive, then 8 < 8*, and otherwise 8 > 8* . We can then 
refine our upper and lower bounds. This leads to a simple 
iterative algorithm where the distance between the upper and 
lower bounds decreases by a factor of 2 on each iteration. It 
thus approaches the optimal 8* value exponentially fast. Each 
step of the iteration involves minimizing an expectation, rather 
than a ratio of expectations. 



B. Optimizing over Pure Policies 

Note that for any set of policies S, Lemma [5] implies 
that inf 7re5 E{a(7r) - 6»6(tt)} = if and only if 8 = 
inf^gg E {a(7r)} /E {6(71-)}. Now suppose we have a set of 
policies p'P' ure that we call pure policies, and that the policy 
space V consists of all pure policies as well as all "mixtures" 
(or convex combinations) of pure policies, being policies that 
choose a pure policy in , pp ure with some particular probability 
distribution. More generally, define SI as the set of all vectors 
(E {a(7r)} , E {&(7r)}) achievable over tt G p pure , and suppose 
the set of all (E{a(7r)} ,E{&(7r)}) achievable over tt G V is 
equal to the convex hull of SI. Recall that 8* is the infimum 
ratio of E {a(7r)} /E {6(tt)} over tt G V. Then: 



= inf E \a(Tr) 



f b(Tr)} = inf [a -0*6] 

(a,b)£Conv((l) 

inf [a -0*61 

(a,6)en 

inf E{a(7r) - 8*Utt)} 



where the third inequality holds because the infimum of a 
linear function over the convex hull of a set is equal to the 
infimum over the set itself. It follows that 8* is also the 
infimum ratio of E{a(7r)} /E{6(tt)} over tt G V pure . 

This means that to achieve the infimum ratio over policies 
tt G V, it suffices to restrict our search to pure policies. 



C. Optimizing with Initial Information 

Suppose at the beginning of each frame, we observe a 
vector 77 [r] of initial information that can affect the penalties 
and frame size. Suppose that {?7H}^L is i.i.d. over frames. 
Each policy tt G V first observes r\ [r] and then chooses a 
sub-policy tt' G V^r], where P v [r] is a space that possibly 
depends on T][r}. To minimize E{a(7r)}, it suffices to observe 
rj[r] and choose 7r' G 7 , ?7 [r] to minimize the conditional 
expectation E {a(7r')|T7[r]}. However, this is not necessarily 
true for minimizing the ratio E{a(7r)} /E{&(7r)}. 

A correct approach is the following: If 8* is known, we can 
simply choose tt' G V v [r] t0 minimize: 

E{a(7r') -0*&(tt')|t7H} 
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If 9* is unknown, we can carry out the bisection routine. Let 
9 be the midpoint in the current iteration. We must compute: 

inf E{a(7r) - 6b(n)} = e( inf E {a(%') - 6b(n')\r)[r]}\ 

(49) 

The infimizing decision tt' can be made by observing rj[r], 
without requiring knowledge of its probability distribution. 
However, the value in d49l cannot be computed without 
knowledge of this distribution. Instead, suppose we have W 
i.i.d. samples {t] w }w=i- We can then approximate the value 
in ( |49l by the function val(9) defined below: 

w 

val(e)±-J2 E{a(7r')-^(7r')|^} (50) 

By the law of large numbers, val(8) approaches the exact 
value of d49l with a large choice of W. The bisection routine 
can be carried out using the val(9) approximation, being sure 
to use the same samples at each step of the iteration (but 
different samples on each frame r). Note that val(9) is non- 
increasing in 9, so the bisection will converge provided that 
it is initialized so that val(9 m i n ) > and val(9 max ) < 0. If 
we cannot independently generate W samples, we use the W 
past observed values of r)[r] from previous frames. There is a 
subtle issue here, as these past values have influenced system 
performance and are thus correlated with the current a(w) and 
b(ir) functions. However, a delayed queue argument similar to 
that given in [27] shows these past values can still be used. 

VI. Alternative Algorithms without Ratio 
Minimization 

We first present an alternative formulation to (j5]l-(|7]i that is 
easier and does not require minimizing a ratio of expectations 
every slot. We then present an alternative algorithm to the 
original problem (|5])-(|7]l that does not require a ratio of 
minimizations, but which yields a less explicit convergence 
result. 

A. Alternative Formulation 

Note that constraints of the form y l < are equivalent 
to yjT < a in the special case c; = 0, and thus can be 
handled using the framework of this paper. Thus, if f[r] is 
some penalty, and if we desire the constraint / < 6.7, then we 
can define y[r]=f[r] — 6.7 and note that the desired constraint 
is equivalent to TV < 0. In other words, constraints of the form 
Vi/T < ci are more general than constraints of the form 77 ; < 
or Vi < c for some constant c, and contain these as special 
cases. 

Now consider the following problem structure: 

Minimize: y Q 
Subject to: yjT < cj V7 G {1, . . . , L) 
n[r] eVire {0,1,2,...} 

Such a problem has a different structure than the problem (01- 
©, and is easier to solve as it does not require a ratio of 
expectations. It can be solved using the same virtual queues 



Zi[r] in ( TTTb . but every frame r observing Z[r] and selecting 
a policy n[r] £ V to minimize the following expression: 

E{Vy (n[r]) + Ef =1 Zi[r]M^[r]) ~ qT (^[r])]|Z[r]} 
Analysis of this algorithm is given in Appendix C. 

B. Alternative Algorithm 

The following is an alternative algorithm for the original 
problem (|5])-(|7| that does not require a ratio minimization (and 
hence does not require a bisection step): Use the same virtual 
queues Z t [r] in {F7]>. Define 0[O] = 0, and define 9[R] for 
RG {1,2,3,...} by: 

0[R}= Ef=o Voir}/ Ef=o T[r] (51) 

Every frame r, observe Z[r] and 9[r] and select a policy 7r[r] £ 
V to minimize the following expression: 

nVM?r{r}) - 9[r}f(n[r})]\Z[r],e[r}} (52) 
+E{Ef=i ZiMMAr]) ~ cif(7r[r])}\Z[r},9[r]} 

It is shown in Appendix D that all constraints are met, and 
that if 9[r] converges to a constant with probability 1, then 
with probability 1: 

lima-*,,, EfJo 1 Vo[r]/ Ef=o T W < ratio ?* + 0(1/V) 

The disadvantage is that the convergence time is not as clear 
as that given in part (b) of Theorem Q] Further, use of the time 
average ( BTT l makes it difficult to adapt to changes in system 
parameters, so that it may be better to approximate dSTb with 
a moving average or an exponentially decaying average. 

VII. Simulations for a Task Processing Network 





Control 
Phase 


Transmission 


Idle 










0.5 T l,a "[r] ldle[r] 



Fig. 2. An illustration of the 3 phases of a renewal frame r 6 {0, 1, 2, . . .}. 

Here we provide a simple task processing example. An 
infinite sequence of tasks must be processed one at a time 
with the help of a network of 5 wireless devices. This applies, 
for example, in scenarios similar to [23] where each new task 
represents an event that is sensed by the wireless devices (each 
at different sensing qualities [28]), and we must select which 
device reports the event information. The renewal structure 
is shown in Fig. |2] At the beginning of each new task r, a 
period of 0.5 time units is expended to communicate control 
information about the task. Each of the 5 devices expends 0.5 
units of energy in this control phase. At the end of this phase, 
the network controller obtains a vector r\ [r] of parameters for 
task r. The vector rj [r] has the form: 

rj[r] = [(quah[r],T( ran [r}), ■ • • , (guaZ 6 [r],2f an [r})} 

where for each I € {1,...,5}, quak[r] is a real number 
representing the information quality if device I is chosen to 
process task r, and Tf ran [r] is the transmission time required 
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Utility versus W 
Modified Algorithm with Time Averaging 




Sample Size W 

Fig. 3. Utility for the drift-plus-penalty ratio algorithm (with bisection) and 
the time-averaged alternative. 

for device I to transmit the corresponding information to a 
receiving station. The controller must choose one of the 5 
devices to process the task, and must also choose the amount 
of idle time at the end of the frame (chosen within the interval 
jq jmaa:j f or some constant T max > 0), so that the policy 
decision ir[r] has the form: 

7r[r] =(l[r],Me[r]) e {1, 2, 3, 4, 5}x {/ e R|0 < / < I max } 

Define p tran as the power expenditure associated with 
wireless transmission. The chosen device I [r] expends p tran x 
T*f? n units of energy in the transmit phase, while all other 
devices I ^ l[r] expend no energy in this phase. None of the 
devices expend energy in the idle phase, which helps to limit 
the average power expenditure in the system. 

The goal is to maximize the quality of information (q.o.i) 
per unit time subject to an average power constraint of 0.25 at 
each device. Define yo^M) as — 1 times the q.o.i. obtained 
for task r, yi (ir [r] ) as the energy expended by device I on task 
r, and T(7r[r]) as the frame duration for task r: 

£oOM) = -quali [r] [r] 

UAA) = 0.5 + P tran Tr an [r]l {l[r]=l} VI 6 {1, ... ,5} 
f(ir[r]) 4 0.5 + Tffi n [r] + Idle[r] 

where l{j[r]=/} is an indicator function that is 1 if l[r] = I 
and else. The problem is then to minimize y Q /T subject to 
yjT < 0.25 for all I S {1, . . . , 5}. 

We simulate the drift-plus-penalty ratio algorithm for 10 6 
frames, using the bisection method with W past samples 
of r)[r] as in ^ of Section |V£] We use P tran = 1.0, 
jmax _ g o ^ ne vectors {j7[r]}^L are assumed to be i.i.d. 
with independently chosen components, where T{ ran [r\ is 
uniformly distributed in [0.5, 2.5] for all I, and quah[r] is uni- 
formly distributed in [0, 1] for I e {1, 2, 3, 4, 5} (so that device 

5 tends to have the highest quality, while device 1 tends to have 
the lowest). We initialize 9 min = -5V, #max= *Cj=i Zi[r}3. 
Each step of the bisection computes val(0) in (T50b according 
to a simple deterministic optimization. In particular, for the 
u>th term in val(8), corresponding to sample rj w , we choose 
Idle[r,w] — whenever 6 < 0, and Idle[r, w] — p nax if 

6 > 0, and choose l[r,w] as the index I G {1, ...,5} that 



minimizes: 

-Vquak[r,w} + {Zi[r\P tran - 6>)lf an [r, w] 

The bisection routine is run for each frame until 9 max — 
&min < 0.001. Using V = 100, the resulting q.o.i per unit 
time is plotted in Fig. [3] This increases to its optimal value as 
W is increased. However, in this example, W does not need 
to be very large for accurate results: Even W = 1 produces 
a value that is near optimal (note that the y-axis in Fig. [3] 
distinguishes utility only in the 3rd significant digit). 

All average power constraints are met in all simulations 
(for each W). Results for W = 10 are: q.o.i /T = 0.852950, 
T = 3.180275, Idle = 1.421260, y = -2.712615, and: 

y 1 lT = 0.182335 < 0.25 
y 2 /T = 0.249547 < 0.25 , y 3 /T = 0.250018 < 0.25 
yjT = 0.250032 < 0.25 , y 5 /T = 0.250046 < 0.25 

It can be seen that devices {2, . . . , 5} are utilized to their 
maximum power constraints because these tend to give the 
highest quality, while average power for device 1 is slack. 

The alternative algorithm of Section IVI-B1 which does not 
require a bisection routine and amounts to a simple deter- 
ministic optimization for d52l every frame, achieves similar 
time average power expenditures to the above. It also achieves 
utility as shown in Fig. [3] being the constant that does not 
depend on W (as no sampling from the past is needed). Its 
utility is slightly larger than that of the bisection algorithm, 
and is approached by the bisection algorithm as W increases. 
It appears that this algorithm is simpler and yields "automatic 
learning" by using the time average value 6[r], but it might 
have trouble adapting if system parameters change. 

Details on the particular decisions made in the simulation 
on each frame are provided in Appendix G. There, it is also 
shown that the particular structure of this example admits 
deterministic bounds on the constraint violations. In particular, 
if I max is chosen to be suitably large (11.0 in this case), then 
we can guarantee that for all integers R > 0: 

Ef^fftlV] <c d 1+ d 2 V 

Ef=o T[r] ~ 1 R 
for some positive constants di 1 d 2 - If we re-run the simulations 
using p nax = 11.0 and W = 10, we get similar average values 
as given above, and in particular the algorithm results in the 
same Idle w 1.42, as expected. 

VIII. Conclusion 

We have developed a method for optimizing time averages 
in general renewal systems. Every renewal frame, a policy is 
chosen that affects the frame size and also affects a penalty 
vector and/or an attribute vector. A dynamic algorithm was 
developed to minimize the time average of one penalty subject 
to time average constraints on the others. A related algorithm 
was developed to maximize a concave function of the time 
average attribute vector, subject to time average constraints. 
This work extends the theory of Lyapunov optimization to 
treat much more general classes of systems, including task 
processing networks with variable length scheduling modes. 
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Appendix A — Proof of LemmaQ] 

We present a definition and a simple lemma before the 
proof of Lemma Q] Define T as the set of all vectors 
[yo, 2/1, . . . , ul, T] £ M. L+2 that are achievable as time aver- 
ages under i.i.d. algorithms. Thus, a vector [yo, j/i, . . . , 2/l, T] 
is in the set T if and only if there is an i.i.d. algorithm ir* [r] 
such that: 



yi VZe{0,l,...,£} 
T 



E{w(7r*[r])} 
e{t(tt*H)} 

It is easy to show that the set T is bounded and convex. Further, 
for any frame r and under any (possibly non-i.i.d.) algorithm 
ir[r], we have: 



E 



{[j/oUM), • ■ • ,y L (Tr[r]),f (7r[r])]| 



g r 



This is because the policy chooses 7r[r] according to some 
conditional distribution given the past history, but this distri- 
bution can be viewed on frame r as one that is from an i.i.d. 
algorithm. 

Lemma 6: For any algorithm that chooses 7r[r] eP for all 
frames r, we have for all integers R > 0: 

- £ E {M*W), ■ • • , foMr]),T(7r[r])]} G r 

Proof: (Lemma |6]l Each of the individual terms in the sum 
is in r, and so the average of these terms is in T (because T 
is convex). □ 

We now prove Lemma Q] Suppose the constraints of prob- 
lem (fl2]i-(fT4l> are feasible, and define ratio opt as the infimum 
of the objective function over all feasible algorithms. Fix any 
5 > 0. Then there must be an algorithm n[r] that satisfies the 
constraints (fl2l l- (fl4l i and yields a limsup ratio within 6/2 of 
the infimum. That is, ir[r] G V for all frames r, and: 



lim sup 

R— >-oo 



lim sup 



< raiio opt + .S/2 



< c/ , V7 G {!,... ,L} 



iE^oE{T(7r[r])}_ 
It follows that there is a finite integer R* such that: 



R* 



< raiio opt + <5 



< 



5, Vie {!,..., L} 



By Lemma [6] we know there is an i.i.d. algorithm 7r*[r] such 
that: 

— ^ E{[^(7r[r]),...,fo(7r[r]),f(7r[r])]} = 

r=0 

E{[y (^W),...,2/L(^H),f(^[r])]} 



Plugging this identity into the above inequalities yields: 



E{yofr*M)} 
E{f(7r*[r])} 
E M)} 



E 



{f(7r*[r])} 



< ratio opt + S 

< ci + S, VZ e {!,... ,L} 



Multiplying the above by the positive number E i T(7r*[r])| 
proves Lemma [T] 

Appendix B — Bound ONy l [R]/T[R] in TheoremQ] 

This section provides an upper bound on y l [R]/T[R], which 
shows how long is required to come close to meeting the 
constraints yjT < cj. Suppose the assumptions of Theorem 
[T]hold. From (ffTJi we have for all frames r: 

A(Z[r]) <B + T max [C + Vratio opt ] - Vy^ m 

where we have used the fact that the conditional expectation 
of 2/o(""H) is bounded below by j/J" n , and the conditional 
expectation of T(7r[r]) is bounded above by T max . It follows 
that: 

A(Z[r]) < (Fi + VF 2 )/2 (53) 

where the constants F\ and F 2 are defined: 

Fi42(S + T max C) , F 2 ^2(T max ratio opt -y™ m ) 

Substituting the definition of A(Z[r]) in ( l53l yields: 

E{L(Z[r + l])-L(Z[r])\Z[r}} < (F 1 + VF 2 )/2 

Taking expectations of the above and using the law of iterated 
expectations yields: 

E{L(Z[r + 1])} -E{L(Z[r])} < (F 1 + VF 2 )/2 

The above holds for all frames r. Summing over r G 
{0,1, . . . , R — 1} (for some positive integer R) and dividing 
by R gives: 

E{L(Z[R])}-E{L(Z[0})} 



R 



< (F 1 + VF 2 )/2 



Using the fact that £( z M)=fEfc=i z i\ r ?> we have: 

AE{Z ; [i?] 2 } , E;-i E {^[0] 2 } 
p (Ji + Vfa)- 1 



;=i 



1? 



1? 



Thus, for every £ G {1, . . . , L} we have: 

E{^[fl] 2 } J\ + ^F 2 Ef=iE{^[0] 2 } 
i? 2 " R R 2 
By Jensen's inequality, E{Zi [R] } 2 < E {Zi[R} 2 }, and so for 
all integers R > and all Z € {1, ... , £} we have: 



;{Z,[fl]} < F 1+ VF 2 , EtiE{^[0] 2 } 



R - \j R R 2 
Therefore, from ( f34T > we have for all I G {!,..., L}: 



T[J2] 



< Q 



< Q 



1 E{Z,[i?]} 



R 



1 jFi+VF 2 , Ef=iMW} 



J. ., , 



i? 



i? 2 
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Appendix C — Analysis of Alternative 
Formulation 

Consider the alternative problem from Section IVI-AI 

Minimize: lim sup^^^ y [R] (54) 

Subject to: limsup^^ |fg < cj V/ G {1, . . . , L} (55) 



r[r] ePVre {0,1,2,...} 



(56) 



Assume the same boundedness assumptions of Section III-BI 
hold. Further assume that a C-additive approximation of the 
algorithm of Section lVI-Al is implemented, so that every frame 
r we observe Z[r] and choose ir[r] to yield: 



Ejvft.WrD+^Z, [r] [yt (tt [r] ) - c z f (tt [r] )] | Z [r] | < 

c + e<U (tt* W) + S z < M ( 7r * M ) ~ c ^( 7r * W )] } < 57 > 

where C is a given non-negative constant, and n*[r] is any 
i.i.d. algorithm. 

Theorem 3: Under the above assumptions, and assuming 
the constraints of the problem d54li-(l56l) are feasible, then: 

a) For all Z G {1, . . . , L] we have: 

lim sup I/, [i?]/T[i?] < ci VZ € {1, ...,£} (58) 

lim sup < q (tw.p.1) (59) 

b) For all integers B > we have: 

-rpi^ opt , B + c , E i L (Z[0])} 
y [R]<yo p +-^ + ^ — 

where B is defined in JT9] l, and 2/q P is the infimum value of 
(l54l subject to (t55l-(t56l>. 

Proof: Consider any frame r G {0,1,2,...}. From ( U~8l we 
have: 

A(Z[r]) + VTE{j/ (7rM)|Z[r]} < 5 
+E {^o(7r[r]) + Ef =1 ZiH^^M) - qf (7r[r])]|Z[r]} 
Substituting (l57l i yields: 

A(Z[r])+yE{y (7r[r])|Z[r]} <B + C 

+E | Vy (tt* [r] ) + £ [r] (tt* [r] ) - c,f (tt* M )] | (60) 

where 7r*[r] is any i.i.d. algorithm. As in Lemma [T] it can be 
shown that if the problem (I54li-(l56il is feasible, then for any 
5 > there is an i.i.d. algorithm tt* [r] such that: 

E{y (K*[r])} < vT + S 
E{ft(7r*[r])}/E{f(7r*[r])} < q + <5 V/ G {1, . . . , L} 

Plugging the above into the right-hand-side of (l60l l yields: 

A(Z[r]) + VE{y (7r[r})\Z[r}} <B + C + V(y° pt + S) 

L 

+ J2 Z i H [(cz + <*)E {f (tt* [r] ) } - E { Cl f(n* [r] ) } 
i=i 



Taking S — > yields: 

A(Z[r})+VE{y (TT[r})\Z[r}} <B + C + Vy° Q pt (61) 

From this we obtain: 

A(Z[r]) <B + C + V{yT -y™") 

Thus, the quadratic Lyapunov drift is less than or equal to a 
constant, from which we obtain the result of part (a) by the 
same argument as in the proof of Theorem Q] 

To prove part (b), taking expectations of doTT l yields: 

E{L(Z[r + l])}-E{L(Z[r])}+VE{y [r}} < B+C+Vy° pt 

Fix an integer B > 0. Summing the above over r G {0, . . . R— 
1} and dividing by B gives: 

E{L(Z[B})} -E{L(Z[0])} 



R 



+ Vy [B]<B + C + Vy° 



opt 



Rearranging terms and noting that E{L(Z[i?])} > proves 
the result. □ 

Appendix D — Analysis of the Alternative 
Algorithm with Time Averaging 

Here we consider the original problem of minimizing y Q /T 
subject to y~i/T < c; for I G {1, ...,L}, and analyze 
the alternative algorithm (with time averaging) described in 
Section IVl-BI Recall that queues Z[r] still operate according 
to (QjJ. Define 0[O]4O, and for integers r > define y^ v [r], 
T av [r], 9[r] by: 



r-1 



n av 



i— n i— n L J 



Assume that we use a C-additive approximation for the policy 
selection step in d52b . so that every frame r we observe Z[r] 
and 9[r] and choose 7r[r] G V to yield: 

E{V[y ^[r])-e[r)f(n[r])}\Z[r},9[r]} 

L 

+E{Y / Z l [r][y l (TT[r}) - c,f(«[r])]\Z[r],9[r]} < 
i=i 

C + VE {y (n*{r})} ~ 0[r]E {f(?r*[r})} 

L 

+ Y,Zi[r}[E{ yi (Tr*[r})}- Cl E[f(7r*[r})}} (62) 
1=1 

We now make the following convergence assumption: 
Assumption 1: There are constants y^ v , T av , and 

8*=yQ V /T av , such that under the implementation we have the 

following convergence properties: 

lim (y^[r],T av [r],9[r]) = (y™,T av ,9*) (w.p.l) (63) 

r— too 

lim (E{yS v [r}} ,E{T a >]} ,E{9[r}}) = (y a \T™,9*) (64) 

r— ► oo 

The equality (l64l typically holds whenever (l63l holds. 
Indeed, taking an expectation of ( |63l and assuming we can 
pass the expectation through the limit yields d64l l. We can 
exchange the limit and the expectation whenever 2/Q"[r], 
T a "[r], 9[r] are deterministically bounded for all r and all 
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sample paths, or when milder conditions hold that allow the for some finite constant F, and so part (a) holds by arguments 



Lebesgue dominated convergence theorem to be applied [24]. 
If d63l l holds, it is easy to show that, with probability 1: 



, R-l 

lim - V 9\r}T\r] = 9*T a 



Vo 



It is natural (and useful) to assume the above also holds in 
expectation: 

Assumption 2: With 9*, T av , y§ v as defined in Assumption 
1, we have: 

R-l 



iim ^EfflrHi^rr 

f-irvi hi ^ » 



R.^oo R 



r=0 



Theorem 4: Assume the constraints of the problem (fT2l - 
( fl4l > are feasible, that E{L(Z[0])} < oo, and that we use 
a C-additive approximation as described above every frame. 
Then: 

a) For all I € {1, ... , L} we have: 

Iimsupf;[i2]/T[i2] < q V/ 6 {1, ... , L} 

R^roo 



lim sup 



v^-R-i r 
Lr=0 vA 



< Cj (w.p.l) 



b) If Assumptions 1 and 2 hold, then the achieved value of 
y /T satisfies the following with probability 1: 

lim y% v [R]/T av [R] < ratio opt + (B + C)/VT mm 

where B is defined as in ( fl9] l (with a conditional expectation 
that is also given 0[r]). 

Proof: Define A(Z[r],#[r]) as the conditional drift, condi- 
tioned also on knowledge of 6[r]. This has the same form as 
(fT8l l, and so from ([T8T l we have: 

A(Z[r],6[r}) + VE {y (Tr[r}) - 6{r}f (Tr[r})\Z[r], 6»[r]| < 
B + VE |y (7r[r]) - e[r]f(ir[r])\Z[r],9[r}^ 

+E | £ Zl M ^ ^ H ) " Clf ^ M )] I z W ^ M I 



< 



Using ( I62l i in the right-hand-side above yields: 

A(Z[r],e[r})+VE^y (w[r}) - 9[r]f(iv[r])\Z[r],e[r]j 

-C + VE{y (ir*[r])}-e{r]E{f(Tr*[r})} 

L 

■Y,Zi[r]mM* m [r])}-ciE{f(**[r])} 



B 



i=i 



similar to those in the proof of Theorem [T] 
Now taking expectations of d65l ) yields: 

E{L(Z[r + l])}-E{L(Z[r])} 
+V[E{y Q [r]}-E{9[r]T[r}}] < 
B + C + Vratio opt E {f (tt* [r])} 

-UE{6»[r]}E{f(vr*[r])} 

Summing the above over r S {0, . . . , R — 1} and dividing by 
i? yields: 

E{L(Z[E])} -E{L(Z[0])} 



7? 



R-l 



v[y [R]- ^EE{fl[r]r[r]}]< 



|r(7r*[r])| 



r— u 

B + C + Vratio opt E 



R-l 



-V 



E{f(7r*[r])}-^E{^r]} 



Thus: 



R-l 



v[y [R}-^J2 E ^ T ^ 



< 



Vratio opt E{f(ir*[r})} 

E{f ( .. W) }Ii;V {e H } + ™» 



C 

R-l 



r=0 



However, by Assumptions 1 and 2: 

1 «"! 

lim Wfl]-p^E{f[#]}] 



1i aV 

Vo 



n av 

Vo 



r=0 



Thus, taking limits of the above yields: 

< B + C + Vratio opt T* - VT*9* 



where 7r*[r] is any i.i.d. algorithm. Now plug the i.i.d. al- 
gorithm TT*[r] from <fT3T> - (TT~6l> into the right-hand-side of the 
above and take 5 — > to yield: 



where we have defined T*=l 
the fact that if E{9[r}} ->• 9 
Rearranging terms yields: 

y av^ T avA 9 * < ratio opt + 

This proves Theorem [4] 



Appendix E 



:{r(7r*[r])}, 



and we have used 



then so does its time average. 



B + C 
VT* 



< ratio opt 



B + C 

'y r pmin 



□ 



< 



Variation on Jensen's Inequality 

Here we prove part (a) of Lemma [5] Recall that (T, 7) 
is a random vector with arbitrary joint distribution such that 
T > and 7 6 TZ with probability 1, and < E{T} < 
00. Define {(T[r], 7[r])}^_ as an infinite sequence of i.i.d. 
( >. ( random vectors, each distributed the same as (T, 7). Define 

C + Vratio° pt E\T{n*[r])^ - V9[r]E\T(n*[r])^ (65) i[0 ] = 0, and for integers r > define t[r]AJ2Zo The 

value of t[r] can be viewed as the rth renewal time in a process 
with i.i.d. renewal durations of size T[r]. Define -f(t) as a 
A(Z[r],9[r}) < F random vector process defined over continuous time t > 0, 



A{Z[r],9[r})+VE[y (7T[r}) - 0[r]f(n[r])\Z[r],0[r]} 



B 



The above has the form: 
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taking the value *y[r] whenever t is in the rth renewal interval, 
so that: 

7(t) = 7[r] if and only if t[r] < t < t[r + 1] 
Therefore, for any integer r > 0: 



ir 



On the other hand, by Jensen's inequality for integrals of 
concave functions, we have: 



i rt[r] ( i /-t[r] 



-y(t)dt 



Thus, for all integers r > we have: 



< 



However, by the law of large numbers, we know: 



(66) 



1 r_1 
lim — y T[i] 

r—>oo t — ' 
i=0 

r-1 

lim -^T[#( 7 [i]) 

i=0 



E{T} (w.p.l) 
E{7>( 7 )} (w.p.l) 



1 r_ 

lim - V 7 [i]T[i] = E{T 7 } (w.p.l) 

r— >oo T ' ^ 



i=0 



Taking limits of the above as r — !• oo in ( I66l l and using the 
above identities together with continuity of 0(7) proves that: 



E{7>(7)} 
E {T} 



< 



E{T} ) 



Appendix F — Analysis of Utility Optimization 

Here we prove Theorem [2] For simplicity, assume that 
the set r of all expectations [y[r], x[r], T[r]] under i.i.d. 
algorithms is closed. Then, similar to Lemma Q] it can be 
shown that if the problem d42b-(l44l> is feasible, then there is 
an i.i.d. algorithm 7r* [r] and a vector 7 6 1Z such that: 

0(7*) = util opt 
E{x m (Tr*[r})} 



e{tvM)} 

E{y z (7r*[r])} 



(67) 

7 ;Vme{l,...,M} (68) 



E 



{f(7T*M)} 



(69) 



If the set r is not closed, then (f67T>-(f69b can be modified to 
show they hold to within any 5 > 0, and the same result we 
derive below can be recovered by taking S — » 0, as in the 
proof of Theorem Q] 

Now define Q[r] = [Z[r]; G[r\] as the collection of all 
queues, and define the Lyapunov function: 

L M 

L(Q[r])A-Y / Z l [r] 2 + -Y / G m [r} 2 



For simplicity, assume initial conditions satisfy 
Zi[0] = G m [0] = for all I and m. Define 
A(Q[r])AE{L(Q[r + 1]) - L{Q[r})\Q[r}}. By an argument 
similar to that given in Lemma [2] we can square the queue 
update equations ( flTT i and ( f4Lb to obtain the following 
drift-plus-penalty bound: 

A(Q[r])-VE{T[r]0( 7 [r])|Q[r]}< 
D-VE{T[r]^[r])\Q[r]} 



-E|^Z ; [r][y ; H-QTH]|Q[r]j 

M 

-E <j G m {r][T[r] lm [r] - x m [r]]|Q[r] 



where D is defined in (05]). This drift-plus-penalty bound can 
be rearranged as follows: 



D-E- 



A(Q[r]) -VE{T[r}^[r])\Q[r}} < 

M 

' T[r]\Q[r] 



V0(7M) - E G m[rhm[r] 

ni—1 

( L M 

E \ J2 Zi[r] yi [r] - G m [r]x m [r]\Q[r] \ 

m=l ) 
L 

-^Zi[r]c&{T[r]\Q[r]} (70) 



. 1=1 



1=1 



However, by the auxiliary variable update algorithm, we know 
that for every frame r: 



M 



M 



V>( 7 [r]) - E G m [r]-y m [r] > V>( 7 *) - E G m [r] 7 ^ 



m=l 



for any vector 7* = ( 7 i,---)7m) € 1Z. Further, because 
we use a C-additive approximation when choosing n[r] E V, 
every frame r we have: 



E {£f=i ZfrMnM) - g^i g m [r]A m ( 7 r[r])|Q[r]} 
E{f(7r[r])|Q[r]} 



< C 



E^i ^[r]E{^*H)| - EtLi G m [r]E{x m (**[r])} 



E 



{f(^[r])} 



where 7r*[r] is any i.i.d. algorithm. Plugging the above two 
inequalities into the right-hand-side of d70l l yields: 



A(Q[r])-yE{T[r#(7[r])|Q[r]} < 



^(7*) - E G ™w 



7 n 



E{T[r]\Q[r}} 



+CE{T[r]\Q[r}} +E{T[r]\Q[r}} x 



i=l 



;{T(7r*M)} ^ E{T(^[r])} 

L 

-$^Z,[r]qE{T[r]|Q[r]} (71) 



i=i 
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where 7* is any vector in 1Z, and ir*[r\ is any i.i.d. algorithm. 
Plugging the vector 7* and the i.i.d. algorithm ir*[r] from 
d67li-(|69|i into the right-hand-side of {7J} yields: 



A(Q[r}) -VE{T[r}cj>(y[r})\Q[r)} < 
D + CE{T[r}\Q[r}} - Vutil opt E {T[r}\Q[r}} 



(72) 



It follows from the above that A(0[r]) < F for some constant 
F. Thus, we know all queues Zi [r] and G m [r] are mean rate 
stable and rate stable [25]. This proves part (a) by an argument 
similar to that given in the proof of Theorem [T] 

To prove part (b), we have by taking expectations of ( |72l : 

E {L(Q[r + 1])} - E {L(Q[r})} - VE {T[r]0( 7 [r])} < 
I) • CE {T[r}} - Vutil opt E {T[r]} 

Summing the above over r S {0, . . . , R— 1}, dividing by RV, 
recalling our assumption that L(Q[0]) = 0, removing the non- 
negative term E{L(Q[i?])}, and rearranging terms yields: 

i ^E{T[r]0( 7 [r])} > utU<**T[r] - £±|ffl 

where we have used T[R]A^ J2r=o E { T MI- Dividing both 
sides by T[R] and using the variation on Jensen's inequality 
(Lemma [3]) yields: 



iEr=o^ {T[r]j[r]} 
T[R] 



D 



C 



> util opt = (73) 

VT[R] V 



The above holds for all integers R > 0. 

However, we have from d4lT > (similar to derivation of d34ll): 



where the above vector inequality is taken entrywise. Using 
this in d73l together with the fact that ^(7) is entrywise non- 
decreasing, we have: 



x[R}+E{G[R}} [R 
T[R] 



> util opt 



D 



VT[R] 



C 
V 



Using the fact that T[R] > T min gives: 



x[R]+E{G[R]} [R 
T[R] 



> util opt - 



D 



y T r, 



-- (74) 
V 



Now recall that all queues G m [r] are mean rate stable, so 
that: 

-R->oo R 

Taking a liminf of ( f74b as R — > 00 and using continuity of 
<fi(-) proves part (b) of Theorem [2] 



Appendix G — Simulation Details and 
Deterministic Bounds 

Here we provide simulation details for the particular system 
of Section [Vm We also show that deterministic bounds on the 
constraint violations are computable if the p nax parameter is 
chosen suitably large. 



A. Simulation Details 

The bisection algorithm for the above example was im- 
plemented by computing val(6) in d50i >. where the term for 
each sample r\ w is found by observing Z[r] and choosing 
l[r] G {1,...,5} and Idle[r] £ [0, I max ] to minimize the 
following deterministic expression: 



-Vqual l[r] [r] + ^ Z,[r][0.5 + P^ T [ ran [r]l 



V[r]=l}\ 



1=1 



-6{0.5 + T^ n [r}+Idle[r}) 



This is solved by choosing Idle[r] = whenever 9 < 0, 
and Idle[r] = I max if Q > 0, and choosing l\r] as the index 
I 6 {1, . . . , 5} that minimizes: 



-Vquak[r} + (Zj[r]P* 



The alternative algorithm with time averaging is imple- 
mented by observing Z[r] and 9[r] and minimizing ( f52b . 
which in this context amounts to choosing l[r] e {1, . . . , 5}, 

Idle[r] e [0,7 ma:E ] to minimize: 

-Vqual l[r] [r] - Vd[r}(T*{ r f n [r} + Idle[r]) + 

L 

J- Zl [r}[P tran Tr an [r]l {l[r]=l} - Q(T/f r p[r] + Idle[r])\ 
1=1 

This amounts to choosing Idle[r] = whenever V9[r] + 
Tld=i Zi[r]ci < 0, and Idle[r) — I max else, and choosing 
l[r] as the index I € {1, . . . , 5} that minimizes: 



-Vquak[r] -T{ ran [r] 



V9\r\~Z l \r\P t 



+ 



L 

E 



Zk[r]c k 



B. Deterministic Queue Bounds 

Here we show that, if I max is chosen to be suitably large, 
then the drift-plus-penalty ratio algorithm for this context 
yields deterministic bounds on Zi[r]. The ratio to minimize 
is: 

E { -Vqual l[r] [r] + £f =1 Z, [r]m (7r[r) )\Z[r]} 
E |o.5 + r/™ n [r] + Idle[r]\Z[r]} 

Because y;(7r[r]) > 0.5 for all I and all policy choices, and 
— Vquali[ r ] [r] > —5V for all policy choices, the numerator 
above is positive whenever: 



L 

E 

1=1 



Zi[r](0.5) > -5V 



In particular, the algorithm chooses Idle[r] — I max whenever 
Zi [r] > 10V for any queue I € {1, . . . , L}. 
Recall that the Zi[r] update is given by: 

Zi[r + 1] = max[Zi[r] + m(%[r]) - 0.25T (n[r\), 0] 

Because yi(%[r]) < 0.5 + 2.5P tran , and f(w[r]) > 1.0 + 
Idle[r], whenever Idle[r] = p nax we have: 

Zi [r + 1] < max[Z z [r] + 0.5 + 2.5P tran - 0.25(1.0 + I max ), 0] 
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Therefore, Zi[r] cannot increase on the next slot if Idle[r] 

pnax and jf. 

2.5P tran - 0.25 



r 



> 



0.5 



0.25 



= 1.0+ 10P f 



This means that if any queue Zi[r] exceeds 10V, then 
Idle[r] — I max and it cannot increase further. Because the 
maximum increase in Z[[r] is 0.5 + 2.5P fran — 0.25, for all 
I G {1, . . . , 5} and all frames r we have: 

< Z t [r] < 10V + 0.5 + 2.5P tran - 0.25 = d{V + d 2 

provided that this inequality holds for r = 0, where c?i =10 
and d 2 A0.25 + 2.5P tran . Thus, from d33) we have for all 
I G {1, . . . , L} and all integers R > 0: 



< 



< 



Cj 



diV + d 2 
R 



(75) 



where we have used the fact that T[r] > 1.0 for all r. This 
provides a deterministic guarantee on the worst-case constraint 
violation over any interval of frames starting at frame 0. The 
deviation is 0(1/ R), which decays at a faster rate than the 
general 0(y/l/R) bound in d25l 

Appendix H — Convergence Under a Slater 
Condition 

We first present a definition and two theorems from [25]. 
Define %[0]=Z[0], and for integers r > define H[r] as the 
system history up to frame r, being the queue values up to and 
including frame r, and the penalties up to but not including 
frame r: 



«[r]A[Z[Q],Z[l], 
Define L(Z[r\) by: 



.,Z[r],y[%y[l],...,y[r-l]} 



(76) 



L(Z[r])m:tiZi[r} 2 
Define A(U[r]) by: 

A(U[r])±M{L(Z[r + 1]) - L(Z[r])\H[r]} 

Suppose that per-frame changes in the Zi[r] queues have 
bounded conditional fourth moments, regardless of past his- 
tory, so that there is a constant D > such that for all 
I G {1, . . . , L}, all frames r, and all possible TL[r] we have: 



E{(Z,[r + l]-ZzM) 4 |ft[r]} <C 



(77) 



Theorem 5: (from [25]) Suppose L(Z[r]) is the quadratic 
Lyapunov function in d76l l. and that per-frame changes in the 
queues Z\r\ have conditional bounded fourth moments, so that 
( fTTT i holds. Suppose that E {L(Z[0])} < oo, and that there are 
constants B > 0, e > such that for all r and all possible 
H[r], we have: 

A(W[r])<B-eEf=i^W 
Then for alH G {1, . . . , L} we have: 

V 1 2 U 1 < oo , limsup - V Zt[r] < B/e (w.p.l) 

r— 1 r— 



Theorem 6: (from [25]) Suppose L(Z[r]) is the quadratic 
Lyapunov function in (l76l ). and that per-frame changes in the 
queues Z;[r] have conditional bounded fourth moments, so 
that ( fTTl i holds. Suppose that f3[r] is some additional process 
related to the system, and that (3[0] and L(Z[0}) are finite with 
probability 1. Suppose that: 



E 

r=l 



E{/3[r] 2 + Z z [r] 2 } 



< oo 



If there are constants V > 0, B > 0, /3* such that for all 
frames r and all possible H[r] we have: 



A(W[r]) + VE{/3[r]\H[r}} <B + V/3* 



(78) 



Then: 



1 



R-i 



limsup- V /3[r] < (3* +B/V (w.p.l) 
^ f^o 

We now prove the following result. 

Theorem 7: Suppose the same assumptions as in Theorem 
Q] hold, and for simplicity assume that C = 0. Additionally 
assume the fourth moment boundedness condition ( l77l i holds, 
that second moments of yo[r] are bounded by the same 
constant for all r, and that there exists an e > and an i.i.d. 
algorithm tt* [r] such that: 



Wj)} 

E{r(7r*[r])} 



<Ci-e VZG {!,...,£} 



(79) 



Then: 

(a) We have: 

1 B 
limsup — V [j/oM - T[r]ra^o opt ] < — (w.p.l) 

(b) We have: 

limsup ^r=o y o[ r ] < ratw o P t + B t w „i) 

Ef= " VTmm 

Proof: (Theorem |7] part (a)) Use of the history-based 
drift A(H[r]) is required for the above theorems. However, 
manipulations with this drift are almost the same, and the same 
proof of Theorem Q] can be repeated to line d27b to show that 
if the conditions of the theorem hold, then (compare with (|27| > 
and note that C = 0): 

A(H[r]) +VE{y (ir[r])\H[r}} < 
B + E {f(ir[r])\H[r]\ Vratio opt 

Rearranging terms yields: 

A(U[r]) + VEh (ir[r]) - f (ir[r})ratio opt \H[r}} < B (80) 

This is the same as condition ( fTSI l with /3[r]=yo(7i"H) — 
f (ir[r])ratio opt and (3*^0. By Theorem HI it follows that if 
the fourth moment boundedness condition (TT7b holds, and if: 



E{/3[r] 2 + Z ; [r] 2 } 



r=l 



< OO 



(81) 



then we can conclude the result of part (a). 
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It suffices to show that ( f8TT > holds. However, since we know 
that second moments of yo[r] and T[r] are bounded by the 
same finite constant for all r, it is easy to see that: 

— -2 — <°° 

r=l 

It suffices to show that J^^Li Zi[r] 2 /r 2 < oo. To this end, the 
proof of Theorem [T] can be repeated to line d26l l to show that 
(compare with (1261): 

A(n[r})+VE{y (TT[r})\H[r]}<B + 

E{fOr[r])|H[r]} [ C+ ^*<«>1^ 

-Ef=i^iMciE{f(7r[r])|W[r]} (82) 

where 7r*[r] is from any i.i.d. algorithm. Plugging d79l ) into 
the right-hand-side of d82b yields: 



or equal to the product of the limsups yields: 



E{f(Tr[r])\U[r}} 



A(H[r]) + VE{y (Tr[r})\U[r]} < 
E{Vy (TT*[r})} 



C 



{f(n*[r})} 

L 

E{f(n[r})\H[r]}Y,Zl 



IFF 



1=1 



A(H[r}) < B 1 -eT n 



In particular, because all conditional expectations are assumed 
to be upper and lower bounded (and because T mm > 0), we 
have: 

L 

nmin ^2Zi[r] 
i=i 

where B\ is a positive constant. It follows by Theorem [5] that 
EZiH z iM 2 } / r2 < oo for aine{l,. ..,£}. □ 
Proof: (Theorem [7] part (b)) First note that if the limsup 
of a function f[r] is upper bounded by a positive constant, 
then the limsup of is bounded by the same constant, 

where [/[r]] + 4max[/[r],0]. Thus, by part (a) we have: 



lim sup 

R— >oo 



1 R-l 

-YsiVoM-TMraUo^} 

r=0 



B , 

< — (w.p.l) 



Next note that because E {T[r] \H[r}} > T mm > for all r 
and all H[r], and second moments of T[r] are bounded by the 
same finite constant for all r, it can be shown that [25]: 



lim sup 

R— >oo 



R 



We then have: 



Lr=0 Vol 



Er=o TM 



ratio opt 



< l/T min (w.p.l) 



< 



R 



^ R-l 



opt] 



Taking limsups of the above and using the fact that the 
limsup of a product of non-negative functions is less than 



lim sup 

R^oo 



Er=o TM 



ratio opt 



< 



1 B 



(w.p.l) 



□ 
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